How a self-attention layer can learn convolutional filters?
Self-attention had a great impact on text processing and became the de-facto building block for NLU
1
Natural Language Understanding.
But this success is not restricted to text (or 1D sequences)—transformer-based architectures can beat state of the art ResNets on vision tasks
Attention Augmented Convolutional Networks[PDF] I. Bello, B. Zoph, A. Vaswani, J. Shlens, Q.V. Le. CoRR, Vol abs/1904.09925. 2019.
Stand-Alone Self-Attention in Vision Models[PDF] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, J. Shlens. CoRR, Vol abs/1906.05909. 2019.
[1, 2]
.
In an attempt to explain this achievement, our work
On the Relationship between Self-Attention and Convolutional Layers[link] J. Cordonnier, A. Loukas, M. Jaggi. International Conference on Learning Representations. 2020.
[3]
shows that self-attention can express a CNN layer and that convolutional filters are learned in practice.
We provide an interactive website to explore our results.
Figure 1: Attention scores for a query pixel (black central square) for 9 heads (plotted separatly). What if, at every query pixel, each attention head could attend to a single pixel (red) at an arbitrary shift?
Then the self-attention layer could express a convolutional filter of size 3×3
3 \times 3...
We show that a multi-head self-attention layer has the capacity to attend on such pattern and that this behavior is learned in practice.
The multi-head self-attention is a generalization of convolutional layers.
The transformer architecture introduced by Ashish Vaswani and colleagues
Attention is All you Need[link] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 5998--6008. 2017.
[4]
has become the workhorse of Natural Language Understanding.
The key difference between transformers and previous methods, such as recurrent neural networks (RNN)
and convolutional neural networks (CNN), is that transformers can
simultaneously attend to every word of their input sequence.
Recently, researchers at Google AI successfully applied the transformer architecture to images
Attention Augmented Convolutional Networks[PDF] I. Bello, B. Zoph, A. Vaswani, J. Shlens, Q.V. Le. CoRR, Vol abs/1904.09925. 2019.
Stand-Alone Self-Attention in Vision Models[PDF] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, J. Shlens. CoRR, Vol abs/1906.05909. 2019.
On the Relationship between Self-Attention and Convolutional Layers[link] J. Cordonnier, A. Loukas, M. Jaggi. International Conference on Learning Representations. 2020.
[3]
(under review), in which we investigate how transformers process images.
Specifically, we show that a multi-head self-attention layer with sufficient number of heads can be at least as expressive as any convolutional layer. Our finding presents a possible explanation for the success of transformers on images.
Self-Attention & Convolutional Layers
To point out the similarities and the differences between a convolutional layer and a self-attention layer, we first recall how each of them process an image of shape W×H×Din
W \times H \times D_{\textit{in}}.
The mechanisms behind CNNs are very well understood, however lifting the transformer architecture from 1D (text) to 2D (image) necessitates having a good grasp of self-attention mechanics.
You can refer to Attention is All You Need
Attention is All you Need[link] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 5998--6008. 2017.
A Convolutional Neural Netword (CNN) is composed of many convolutional layers and subsampling layers.
Each convolutional layer learns convolutional filters of size K×K
K \times K, with input and output dimensions Din
D_{in} and Dout
D_{out}, respectively.
The layer is parametrized by a 4D kernel tensor W
\mathbf{W} of dimension K×K×Din×Dout
K \times K \times D_{in} \times D_{out} and a bias vector b
\mathbf{b} of dimension Dout
D_{out}.
The following figure depicts how the output value of a pixel q
\mathbf{q} is computed.
In the animation, we consider each shift ΔΔ=[−⌊K/2⌋,...,⌊K/2⌋]2
\Delta\!\!\!\!\Delta = [-\lfloor K / 2 \rfloor, ..., \lfloor K / 2 \rfloor]^2 of the kernel separately.
This view might be unconventional, but it will prove helpful in the following when we will compare convolutional and self-attention layers.
Figure 2: Illustration of the computation of the output value at a given pixel (blue) for a K×K
K\times K convolution.
Multi-Head Self-Attention Layer
The main difference between CNN and self-attention layers is that the new value of a pixel depends on every other pixel of the image.
As opposed to convolution layers whose receptive field is the K×K
K\times K neighborhood grid, the self-attention's receptive field is always the full image.
This brings some scaling challenges when we apply transformers to images that we don't cover here.
For now, let's define what is a multi-head self-attention layer.
A self attention layer is defined by a key/query size Dk
D_k, a head size Dh
D_h, a number of heads Nh
N_h and an output dimenson Dout
D_{\textit{out}}.
The layer is parametrized by a key matrix Wkey(h)
\mathbf{W}^{(h)}_{\!\textit{key}}, a query matrix Wqry(h)
\mathbf{W}^{(h)}_{\!\textit{qry}} and a value matrix Wval(h)
\mathbf{W}^{(h)}_{\!\textit{val}} for each head h
h, along with a projection matrix Wout
\mathbf{W}_{\!\textit{out}} used to assemble all heads together.
Figure 3: Computation of the output value of a queried pixel (dark blue) by a multi-head self-attention layer. Top right displays examples of attention probabilities for each head, red positions denotes the "center of attention".
The computation of the attention probabilities is based on the input values X
\mathbf{X}.
This tensor is often augmented (by addition or concatenation) with positional encodings to distinquish between pixel positions in the image.
The hypothetical examples of attention proabilities patterns illustrate dependencies on pixel values and/or positions:
uses the values of the query and key pixels,
only uses the key pixel positional encoding,
uses both the value of the key pixels and their positions.
Reparametrization
You might already see the similarity between self-attention and convolutional layers.
Let's assume that each pair of key/query matrices, Wkey(h)
\mathbf{W}^{(h)}_{\!\textit{key}} and Wqry(h)
\mathbf{W}^{(h)}_{\!\textit{qry}} can attend specifically to a single pixel at any shift Δ
\mathbf{\Delta} (producing an attention probability map similar to Figure 1).
Then, each attention head would learn a value matrix Wval(h)
\mathbf{W}^{(h)}_{\!\textit{val}} analogous to the convolutional kernel WΔ
\mathbf{W}_{\mathbf{\Delta}} (both in green on the figures) for each shift Δ
\mathbf{\Delta}. Hence, the number of pixels in the receptive field of the convolutional kernel is related to the number of heads by Nh=K×K
N_h = K \times K.
This intuition is stated more formally in the following theorem (proved in our paper
On the Relationship between Self-Attention and Convolutional Layers[link] J. Cordonnier, A. Loukas, M. Jaggi. International Conference on Learning Representations. 2020.
[3]
).
Theorem
A multi-head self-attention layer with Nh
N_h heads of dimension Dh
D_h, output dimension Dout
D_{\textit{out}} and a relative positional encoding of dimension Dp≥3
D_p \geq 3can express any convolutional
layer of kernel size Nh×Nh
\sqrt{N_h} \times \sqrt{N_h}
and
min(Dh,Dout)
\text{min}(D_h, D_{\textit{out}})
output channels.
The two most crucial requirements for a self-attention layer to express a convolution are:
having multiple heads to attend to every pixel of a convolutional layer's receptive field,
using relative positional encoding to ensure translation equivariance.
The first point might give the first explanation why multi-head attention works better than single-head.
Regarding the second point, we next give insights on how to encode positions to ensure that self-attention can compute a convolution.
Relative Positional Encoding
A key property of the self-attention model described above is that it is equivariant to reordering, that is, it gives the same output independently of how the input pixels are shuffled.
This is problematic for cases we expect the order of input to matter.
To alleviate the limitation, a positional encoding is learned for each token in the sequence (or pixel in an image), and added to the representation of the token itself before applying self-attention.
The attention probabilites (Figure 3, top right) are computed based on the input values and the positional encoding of the layer input.
We have already seen that each head can focus on different part (position or content) of the image for each query pixel.
We can explicitly decompose these different dependencies as follows:
absolute-encodingCreated with sketchtool.add value and positionalencoding of queried pixelsame for key pixelonly depends on the positionsof the key and query pixelsonly depends on the contentof the key and query pixels
Because the receptive field of a convolution layer does not depend on the input data, only the last term is needed for the self-attention to emulate a CNN.
An important property of CNN that we are missing is equivariance to translation.
This can be achieved by replacing the absolute positional encoding by relative positional encoding rδ
\mathbf{r}_\mathbf{\delta}.
This encoding was first introduced by Zihang Dai and colleagues in TransformerXL
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context[link] Z. Dai, Z. Yang, Y. Yang, J.G. Carbonell, Q.V. Le, R. Salakhutdinov. Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 2978--2988. 2019.
[5]
.
The main idea is to only consider the position difference δ=k−q
\mathbf{\delta} = \mathbf{k} - \mathbf{q} between the key pixel (pixel we attend) and the query pixel (pixel we compute the representation of) instead of the absolute position of the key pixel.
The absolution attention probabilities can then be rewritten in a relative manner (refer to the paper for the new matrices and vectors parameters):
relative-encodingCreated with sketchtool.
In this manner, the attention scores only depend on the shift and we achieve translation equivariance.
Finally we show that there exists a set of relative positional vectors of dimension Dpos=3
D_{\textit{pos}} = 3 along with self-attention parameters that allow attending to pixels at arbitrary shift (Figure 1). We conclude that any convolutional filter can be expressed by a multi-head self-attention under conditions stated in the theorem above.
Learned Attention Patterns
Even though we proved that self-attention layers have the capacity to express any convolutional layer, this does not necessarily mean that the behavior occurs in practice. To verify our hypothesis, we implemented a fully-attentional model of 6 layers, with 9 heads each.
We trained a supervised classification objective on CIFAR-10 and reached a good accuracy of 94% (not state of the art, but good enough).
We re-used the learned relative positional encoding from Irawn Bello and colleagues
Attention Augmented Convolutional Networks[PDF] I. Bello, B. Zoph, A. Vaswani, J. Shlens, Q.V. Le. CoRR, Vol abs/1904.09925. 2019.
[1]
, learning separatly row and column offset encoding.
The main difference is that we only used the relative positions to condition the attention probabilities, not the input values.
The attention probabilities displayed on Figure 4 show that, indeed, self-attention behaves similarly to convolution.
Each head learns to focus on different parts of the images, important attention probabilities are in general very localized.
Figure 4: Attention maps of each head (column) at each layer (row) using learned relative positional encoding. The central black square is the query pixel. We reordered the heads for visualization.
We can also observe that the first layers (1-3) concentrate on very close and specific pixels while deeper layers (4-6) are attending on more global patches of pixels over whole regions of the image.
In the paper, we further experimented with more heads and obserded more complex (learned) patterns than a grid of pixels.
On the Relationship between Self-Attention and Convolutional Layers[link] J. Cordonnier, A. Loukas, M. Jaggi. International Conference on Learning Representations. 2020.
[3]
, we showed that a self-attention layer can express any convolutional filters given enough heads and using relative positional encoding.
This is maybe the first explanation why multiple heads are necessary.
In fact, the multi-head self-attention layer generalizes the convolutional layer: it learns the positions of its receptive field on the whole image (instead of a fixed grid).
The recepteive field can even be conditioned on the value of the input pixels, we left this interesting feature for future work.
Hopefully, our findings on positional encoding for images can also be usefull for text, it seems that using learned relative positional encoding of dimension 2 should be enough but we need to verify this in practice.
Acknowledgments
To lighten this article and point the reader only to the most relevant papers, we cited only a subset of the relevant work that we built on. Please refer to the bibliography of the original paper
On the Relationship between Self-Attention and Convolutional Layers[link] J. Cordonnier, A. Loukas, M. Jaggi. International Conference on Learning Representations. 2020.
[3]
for the complete list.
The article template is due to distill.pub and many formatting styles are inspired from other articles appearing on Distill.
Attention Augmented Convolutional Networks[PDF] Bello, I., Zoph, B., Vaswani, A., Shlens, J. and Le, Q.V., 2019. CoRR, Vol abs/1904.09925.
Stand-Alone Self-Attention in Vision Models[PDF] Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A. and Shlens, J., 2019. CoRR, Vol abs/1906.05909.
On the Relationship between Self-Attention and Convolutional Layers[link] Cordonnier, J., Loukas, A. and Jaggi, M., 2020. International Conference on Learning Representations.
Attention is All you Need[link] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I., 2017. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 5998--6008.
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context[link] Dai, Z., Yang, Z., Yang, Y., Carbonell, J.G., Le, Q.V. and Salakhutdinov, R., 2019. Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 2978--2988.